Hmm? This isn't supposed to look like just Ah, sod that. Okay, there. Group K_. Da. Yeah. Yeah, I guess. Whatever. It's yeah, I never get this, so what's the point of if this is strapless if this strap's supposed to go behind m I obviously doing this wrong. Obviously failing this like whatev calling centre and tra Like this and then yeah, you know what the shit has to do now. Okay. Yes, so um actually I just realise I don't have power, let me just switch on the other which is gonna run out. Okay. So there's probably not much to talk about at the moment in terms of like talking about each other's stuff, I mean beyond what we've been talking about yesterday. So what I've been thinking is maybe if we try to really make sense together of the X_M_L_ format to be sure that we can all produce the data that we are producing in a way or at least I mean I guess my data I'll probably load in from completely different way anyway because it's matrix, but all this stuff that goes with annotation, that we have it in the right format. Maybe if we just sort of pool what we know about the X_M_L_ format and try to make sense of it. I don't at all know how to go about this. Um what I did just now is I downloaded Um I downloaded some files, some X_M_L_ files and I was thinking if we maybe just should go through some of them and like try to see if you understand the general structure and pool together what we think about the structure, what we think that things mean. And then talk about how the annotation for information density and stuff should maybe be structured in a similar way. Or but I don't r I don't really have an idea like what, other words, what we would do today in this meeting. I mean What's that part? Mm-hmm. Mm-hmm. Yes, so does anyone have a good idea where to start like which what so what is it we talk about getting in at the moment? We're basically just 'cause the two of you will merge your data together, right? In some way like i in the end it'll result in one annotation. So the only thing we're trying to tie in is just an additional annotation which is similar to the annotation about what what would it be s most similar to. Simply to the segmentation information or Okay. I'm not sure it Ah. See, that's what I was thinking. Not sure what software I have at the moment, which would be hello? Um Yeah. Ah. Does anyone know any good standards software displays X_M_L_, I'm not sure. Yeah, but I'm afraid I only have Firefox and Firefox doesn't have the nice X_M_L_ viewer anymore. I think I don't have Mozilla.. No. I can probably just Emacs them. But But going from from a word business, wouldn't it be easier to then go back and calculate for each utterance? You know, wha wha each turn, so if we Yeah. So d Does anyone does anyone know in which file the actual, like what's it called, the splitting into the utterances? Is is called is that Is it What does it end with? Does it with with SAX? Does that look like it? Okay. Be nice. Let's actually just see if maybe K_ right. Oh this is how much time you spend just getting the right software going. Is that better to see? I guess yeah, white and black is better to see. So just see if I can view Okay. So This this But what is this? I mean this doesn't contain any content. Like any text. Oh wait, somebody s s somebody explain that slowly to me. So So what do all these things in this file have in common? What what are they? No but w like uh so they are they are segments in what sense? Wh what's definition of a segment then? Okay. And and that's usually dialogue what what's difference i between provenance dialogue act and between segmenter? I don't A dialogue act uh a dialogue act annotations of dialogue act. So is that just another representation of text? Just see, can we have this files somewhere. So the stuff that's that's called segment here that do we know which file this is pulling from? From yeah, it's it's c it from words X_M_L_, is that it's saying here? Let me see if I have that words X_M_L_ file. So let's for example look at Okay, let me just try to get them all to a different screen. Uh uh is that what we were Sorry, I'm just I'm trying to arrange them, so that we have one above the other, so to make some sense of that. So no. Here we had Um I'm just trying to understand what this segments file is about. Like it's it's referencing here it's referencing to To a certain position in the in the words file. So it's that non-vocal sound one here. That's the mic noise, okay. But what's the does it always just oh no, it's it's referencing to several word structures. So this would be basically be a a sequence, right? So that's from W_ fifty two. Are they not ordered or oh yeah, from W_ fifty two down to W_ what are we talking about? And yeah, w sorry. Okay. So this so this is somebody saying it it doesn't. Do we have any confirmation do we actually see from this file who's doing it, just out of interest? Oh. Sorry, in speaker D_'s file. So this is where he'd like time stops. And then it only starts at a later time, because oh, you don't see when I'm pointing at screen, it's just for this. So then it starts here this time against speaker D_ because in-between there's somebody else. Okay. Yeah. Okay. Okay. So the the file with annotations for information density, what would that be like? Would that be like the segments file, giving a start word and end word, or or would that How would that be to best tie in with the system? Sorry? Something like something like this or like where you give sort of a reference to a beginning and an end position. Well that would be in the same file, wouldn't it? It would just couldn't we have a like uh I guess we could introduce a different um We have for now if we just do something which for every segment in here attaches actually can we not tie it to the segments file? Would that not be the way how they would want it to have to do as if we in the end, if we want to attach to each of those segments one number for now. And those segments have uh do they have an I_D_? Yeah, I guess they do. So this is the I_D_ of a specific segment, right? Yeah, so wouldn't it I mean the problem is they've me haven't looked at the exact inner workings of their of their engineered. But you'd think that the easiest way and the way that how it's intended to be would be just, if we have here a link to the segment, a like an I_D_ for that segment, that we just create another file which links to the segment and then has an additional value which is the number. Yeah, yeah. Yeah, oh sorry, I I thought we were talking about having two files. I was think yeah, I I think that I I understood wrong. I thought you were wanting to have two different X_M_L_ files, with one the reference and one just th just the number. And I was thinking that's probably what you meant, just um having like for each sort of of our segments having just the I_D_, which is referencing to these segments here and another attribute which is the the value. So this whole information we would then store in this Ah, no, that I don't have access to because I didn't download like there's this one meta information file where it describes the structure of all the files and describes which um which attributes they bring in. So we would add that to that file? Saying that sort of we bring in information density. And then we would create the file of that type, which we probably couldn't call it segment. I'm not sure. We probably might have to have a different word for it. I'm not sure if it there's any trouble with it repeating. In the existing segments file. Yeah, I reckon actually if we make a copy of it What uh what are you saying about the w I th doesn't the lazy loading apply to everything? I mean that it sort of dynamically loads. You think it i it l it loads the whole of the segments file every time. Okay. The segment at the moment is split up over Hmm. No but I think it's too early to really like discuss that in detail, because we don't at all understand at the moment how the internal data structure like how the loading works. So maybe if we g if we go ahead, do you think it would be possible for you to do an like something like segments or maybe just a copy of segment which has an attribute for for each segment, um but it is like with a with a value with a density value. Would it be easier though, because all your methods are sort of not working with their whole time frame structure there. So would it be easy for you to to tie the things together. Like if you're doing it on on the word basis here with those words, that in the end you then tie it back in into the right segment here. I mean you probably have to do a bit of Hmm. You're doing it time-based at the moment. So this isn't directly having time references, but you can get the oh, it is actually here. Okay. So if you slice it up by time, you'd probably be able to just like attach like just some attribute of info val just f to each of those segments. Mm-hmm. 'Kay. Um I don't understand enough of what their data structure is. This probably would be a lot easier would I really understand how they are handling the data internally, 'cause then we c then I could say oh, it's easy to just tie it in if you just have it time-stamped that just reference by words and stuff. But t at the moment So you would you're doing i you're doing a word by word base. Mm-hmm. Okay. But every word does does the word picking, like does it always have the same information value in your thing i does it depend on its position like It be d depends on the position. Hmm. And you're doing this via some software that's like external software, so you can't So what if you if you if you get the results from that software and you go back over it then with that file sort of, you write an algorithm which which then goes back because they they're in the right order still and stuff, right? So that shouldn't be a problem. Oh but this is by speaker here which makes it slightly more difficult. The problem is that actually NITE X_M_L_ probably provides a lot of the tools that we'd need to do that. In in a temporal sequence. But but would the information density algorithm still make any sense if you split them up, because I mean the whole I thought the whole thing is that you look at the frequency of a word in that specific how often a word inc occurs in a certain topic versus how often it occurs over the whole corpus and that from that it calculates. Mm-hmm. Yeah, I have a gut feeling it's not a good idea to split it. Yeah. No, but I mean it can't be that difficult. If you if you already have you have in the right order all the words with a with a score to them, and you have a file which has each word and a time stamp. So those two tied together have each word and its time and and and and its proba and its value. Yeah. So they're they're at the moment so It appears to me that Rainbow was made for something quite different. Hmm. I actually like um um i are you actually sure that Rainbow is doing a measure, like is returning a measure of what we are trying to measure, 'cause it it seems to me that it's just it sounds like something quite different in in d many aspects. Mm-hmm. And w Yeah. Yeah. But where do you have the original category information from? Yeah. Yeah. So where do you where do you have them from at the moment the split up? In i in the t in the topics, in the in the human topic s um so you've split them up by topic at the moment. Yeah, but Okay okay okay. 'Cause I was just thinking that doesn't make sense at all, but yeah, if you if that's just while you're waiting. Okay. Have you ever like looked into different ways of calculating, 'cause I was just thinking like I mean the for example the the infor um what's it called, the entropy calculation, is that she boxed it under the simple calculation that you could probably write the script in no time at all and Yeah, but I'm also just like I think that probably the entropy value at the moment for a word is closer to what we're at the moment looking for. I can just like k k I can sit together with you for twenty minutes and just show you the entropy code that I wrote for my other project and it probably and we should work together because it's we used t we have to we'd use the same matrix as I'm using in my latent semantic analysis, you'd use to calculate entropy scores. And then we'd have um a score which actually which would be the same for the word in each position. So in that sense it's doing something a bit different, and like w basically the score that I'm talking about is a conditional entropy score which just checks how much information, the fact that there's one word tells you about what would be the next word. But that's a relatively good measure of whether that's a very specific word, in which case they are usually words which tell you quite a lot or a very general word which usually doesn't tell you quite a lot. Um it it's basically the it's the standard entropy formula. And you sort of you you Yes, yes, it has. I mean, I think the official le sort of the official description of what it tells you is um how much that like the fact that a given word occurs tells you about what's the next word's gonna be. Which doesn't sound too exciting, but it it just works out in the way that words which are promiscuous and which occur with everything all over the place have very low a scores on that, and also usually end up being the words which are p Least like expressive, and l contain less information. Yeah, function words or just very general nouns. Pr probably like what whatev for example the word computer in that context. You could imagine it to f like be in all in all the contexts. Yeah, so we would we w wouldn't do it by word, we would sorry, okay, um I was I was getting that w actually, sorry, I was getting that wrong, I was getting it from what I did my project. Now in this case, we would do it by per mee words per meeting. So Hmm. Yes, it it's probably doing it's probably doing quite the same thing in the end, but I'm just saying like with that thing you would easily have an algorithm which at the moment provides you for each word with a score which we can use. Um no um, I was I was I was describing the wrong thing. In this case we wouldn't be doing how much it tells you about another word. In this case we would be doing given that you know a word, how good is it at predicting from which um specific topic that was. So that would yeah, in that sense it's the same thing here. Yeah, yeah, for a specific Yeah. The same word the word yesterday would be would have the same score all over the place. Okay. I c But but your category thing depends on that we not just have topic segments, but also that these topic segments we have them in categories. You you do wouldn't you need several documents for each category? Or several segments for each category. Do but will it word without that at all? 'Cause Yeah, I mean s so in our case basically every every s topic would be its own category. And the question is does the algorithm still make any sense in that I don't understand the algorithm enough for that. But what I'm really like because the entropical um calculation is so simple, maybe we should look into making that score just as a preliminary score that we have. Like it it's a very it gives you like I've looked at the result, it gives you basically something in the end which vaguely tells you just whether a word is a very specific word or a very general word. And I like there is some hope that probably having just sentences where there's lots of very specific words, if you mark them as being more interesting than the words which are only very general words, that they would get us somewhere. Probably one point zero is very high information value. Yeah, 'cause this would Yeah, in in a sense I mean this is a bit like the what like document frequency over total frequency. Measure it sort of just going by What do you mean every sequence of the same Mm-hmm. Within th within the topic, so like topic we had to when you say topic, you mean like just like from from a beginning to end point, like within one meeting there are several topics? But we don't have that information anywhere, do we? But but you you are segmenting. Well, I'm I'm doing one on segment similarity in the end, yeah. I'm doing like finding similar segments, basic latent semantic analysis. But But like f for now, like your segmentation is just splitting a meeting up into different blocks ver Not not from what Colin is doing from what I no. It's only like I'm writing an algorithm that which then tries to also again based on word p occurrence patterns try to link together maybe different ones of those. So So that yeah. Yeah, I'm also a bit b like I'm not a hundred percent sure about Rainbow being the right thing, 'cause it seems that Rainbow does in its structure quite rely on having different examples of the same category sort of in in a way. 'Kay. You know what, as a byproduct of my L_S_A_ I'll provide um a vocabulary like sort of a dictionary which for each word gives an entropy score entropy score, which just tells you of how much information the presence of a word tells you about which topic it is. Not which category, like I'm not I'm not lumping together separate topic segments into categories. But just like how much this word tells you about wh how likely that w the occurrence of that word makes it that it's a specific segment topic segment. Which is some measure already of how widespread this word is versus how specific f to a certain segment that is. And I'll just provide that, because that's just not much more work than just the usual thing. And then we can see how we can tie that in with the other stuff. So if you um keep on working on Rainbow meanwhile and try to find a way how to tie your Rainbow stuff into some way that we can attach it to a certain time segment. I'm just thinking, wh if It it used is each word completely unique, like sort of does it treat each word, each occurrence of word, as a completely unique event. Or does it, I mean, no, it it has to, I mean, basically the the form of the word is important, right? We can't just replace the word by an arbitrary string. Because it looks if the same word occurs again and stuff. Yeah, it it has to work with a m yeah, well I think it's a stupid question, like it it it has to work on on on the word, like on What I was thinking is whether if we replace the word by something uniquely id identifiable, then it wouldn't make a different which order it is. But that wouldn't work because it needs the word, because that's all it's working on. It's the word and that looks if that word occurs again and versus how often that word occurs in other context, right? So we can't attach some type of information to the word, just to the word string itself, like making an underscore, making the time or something, that wouldn't work. But would it have that in the untruncated version then, like would it s would the output be the untruncated version? It will probably um no, I don't think it would. Yeah, for now, I mean really just like Yeah, I think if we work together on an on an entropy based score. It's let me see if I can demonstrate T I mean, let's just keep on talking meanwhile and I'll try to start that up. It's it's it's really d it it's a very simple thing, but it it basically just does something which tells you how specific a word is. In a sense it's in a it's it's basically just to to a high degree really telling you how r how rare a word is or how common a word is. But uh as as the first step that's probably for for a prototype for next week that's probably not a bad thing, I mean even if it's just that. Even if you just like if you have segments where where lots of rare words occur, highlighted in darker red than segments where all the very common words occur. That's just that's somewhere to start from. And it's it's a bit more sophisticated than that, but then de facto it just ends up doing that mostly, from what I figured out. Um Actually I think I'm not gonna not gonna start that now, because that's probably gonna take too long. So if we get it on a word by word basis whatever you do it'll probably appear in a word by word basis, and what you have on a sort of segment but not quite segment base. 'Kay, what about the following model? I mean this is a very unscientific way of doing it in some sense, but what if we if we take time as the standard unit for now and sort of like make a massive one segment split super-array. Because everything you're doing can in one way or the other be ti tied down to actual time. So if we if we if for if for each one segment time slot we could attach a value, and then it would be easy to then go back when if we have the time marks here, and re-map that onto onto the length of a segment, you know what I mean? So if you have for each word say and we know that word starts at this segment and ends at that segment, and you have for like for some time period w the overlap and that period on the F_ ones and that period or something. So you also have a value which can be tied down to a time. And then we could just in m in Matlab or in something just create some massive super-array of of things for each for each like sort of time sampling slot and and calculate a value for this, and once we have this array, we can go with the script and sort of go for each segment to the starting and end times and say okay, this is from our time segment f here to this time segment there. So we take the some of those and divide them by by the number or something, create the average and put it in as the value for the segment. You know what I mean? Oh. Tha that's that's a fiddling play in the end, but that problem we'll always have. Because we don't have a useful way of automatically evaluating um We don't have a way of usefully evaluating automatically what's good and what's bad, so it's it's probably always gonna be a question of looking at it and saying okay, like running it with different different factor loadings and seeing okay, this way it works this well and this way it works that well. But it probably be more difficult for mapped t to map for you to map it on so for you sort of i Yeah, but so you say that inst like you basically say w having an array where each each cell is is one like is o is one word.. And then you would map your information onto individual words. Hmm? Would you be able to find out which word that is and 'Kay, and and then we could have like some type of just point in the end where the one scores from the all the individual word cells get multiplied all with like for each utterance or whatever. You have get all multiplied with the same value all the ones that are within that utterance. That's sort of the combination of of the two scores. And then we'd have to go back again and then put that back into that segment mode here. So that we because in the end we don't want it on a per word basis, but probably on a per segment base. Oh, yeah yeah, actually that yeah, that's true, so that it's easier if you are able to yeah well, I think that's probably back where we started at this. But you said it wi but you said it's more complicated, because your segments aren't those segments exactly. Hmm. And their segments do overlap. Uh the tho those do or those don't? Mm-hmm. Okay. So you could on their granularity you could on their granularity create a score like for each for each of their segments. Well I guess I mean for you, if you know for each word, if you find that out, then it has to be possible, because if we know sort of this is going from word to word or it this is going from time to time and then there has to be a way then for you say okay, this concerns these following words, and then just make make a simple mean over them. I think an interesting thing is if we don't combine your two scores in the in the X_M_L_ file you had, but if we do that in the software, then we can probably make ways of playing with it in the software and sort of l you know like adapting some sort of control, like playing around with look like it's playing with different weightings for that the utterance-based one versus the word-based one and sort of look at it dynamically. You know what I mean? Like playing around, sort of figuring out what's the best way of combining them by playing around and looking at the results. Yeah. Yeah, we could probably like make a uh um graphic display initially, at least for our experimenting. We'd just place them in different ways and then see how they interact with each other. Okay. Yeah. So do you think y like both of you then can map something onto their segments, like just each of you provide one value, like double value or whatever, like one decimal value or whatever onto onto exactly their segments. It's stated both in It's done both in terms of words and in terms of segments. It's a bit sad sort of that we do this before we've truly figured out how the NITE X_M_L_ thing works, because now we're doing it all by hand and like parsing and un-parsing that thing and it's it's all part of the framework. Uh How much easier would it be if we truly understood this. So 'Kay, I might just change my order of in which I do things and like forget my latent semantic analysis stuff until the weekend and try to really make sense of the of the NITE data system now, so that maybe as soon as I've understood that we find ways of doing that in within the NITE framework already, so that we don't manually have to parse times and entire things together. Well, at the moment what you would do like to to solve this problem is you would sort of like write some Perl script or something that gets this time value out of here and Okay. Mm-hmm. Mm-hmm. So provided that you get your words in the right order, do you think it is an easy task for you to it it's a f relatively feasible task for you to to get just a single value per segment? Okay. And you say you think you're able as well to map onto those segments. And if you're both able to map into those segments, then we should be able to get one file where we have like whatever two values, value A_ and value B_ both as attributes for for this. And that we could load into a prototype and see what types of disp what ways of displaying this information are there. Hmm. Mm-hmm. Hmm. There's probably also social interaction factors in that there's sometimes just a meeting like if people adapt their F_ zero to each other, then they're sometimes p Yeah. Can you not do something like just like not measuring the F_ zero or the amplitude at all, but just like the variance of F_ zero within a certain time frame and like sort of like just have some part for its very low variance with more the same F_ zero and one where there's a lot of more variance. I don't well actually I don't know about that at all. Yeah. Sorry. Mm. Hmm.. Okay. So at the moment you you just you measuring the F_ zeros relative to the average for the speaker or Mm-hmm. Mm-hmm. Mm-hmm. Ri so the average Okay. Um so what you would be feeding in would be just like one value per speaker per meeting, so that oka that that's that's your average baseline, okay okay. No yeah, that that makes m yeah, that makes a lot more sense, yeah. So that would show you how much relative to how he sort of how he's performing generally in that meeting relative to that how he's in a specific segment. He or she. Yeah. Yeah. And for you that would be quite easy to translate into those segments. Yeah, guess I've asked this question fifteen times now. Sorry. Uh Hmm. Oh they w they do exist. The Castrati corps of the International Computer Science institute. Oh well. I'm afraid I have to go soon. But not quite sure, I mean is there anything more we have to talk about anyway? Yeah. See, as soon as as soon as I'm halfway through my L_S_A_ like basically as soon as I have the the matrix built, if of the document by word stuff, it's very easy to then calculate for each word a score, and I can just give you those scores and you can do with them whatever you want. Yeah, I think Colin, Dave and me will actually work on on the Java stuff, and it and we'll just see whatever you whatever you supply us, we'll try to tie in and visualise in some way or another. Um I'll ask Jonathan if we can postpone the meeting to one o'clock. So that would give us a chance of meeting in for an hour before that to discuss the questions that we had. Yeah, I haven't gotten like sort of my confirmation that w Wednesday is fine. I'm not sure if I'm supposed to expect a confirmation for my confirmation from him now or but I'll just email him again and s ask him if we can maybe make it one sorry, we said twelve and I'm asking if he can make one. This is a bit frustrating at the moment, this project, isn't it? It's so like difficult to to get to the point where you understand enough to really feel that. I don't know, I'm not feeling that I'm really working at the moment, I'm more just trying to make sense of everything. And it's a bit too far into the meeting for that, and into the project for that. Hmm. I guess as soon as we have a framework in the w in the type of the prototype where like sort of each of us can tie in their stuff and see what it l how how it looks like and how it performs, you know. That probably makes it a lot easier then, but it's sort of it's a boot-strapping problem, like for the prototype we need some type of data, but to develop the data it would be a lot easier to to have the prototype. Anyway, I gotta go. Hmm? Yes, but if it's if it's in a form which is easy to read in at the moment, that would be fine. Sort of like if basically if we have something like this segments file, but for each of you like just have one attribute. I think it's really easy if we don't merge them before-hand, but if we let them if we combine them in the prototype or don't at the moment, because then we can easily d display them individually, c contrast them to each other and play around with how y to combine them. Exactly, yeah. Yeah. And I mean computationally multiplying two integers or doubles or whatever shouldn't be the thing that slows us down. Yeah. Alright.